1 SAMSA2

Westreich et al. 2018. SAMSA2: standalone metatranscriptome analysis pipeline. BMC Bioinformatics
github repo


 

1.1 RefSeq database

• Non-redundant set of reference proteins (predicted+verified)
• Contains species + protein name
• 27G, 68M sequences

1.2 Subsystems database

• SEED is a categorization system which organizes gene functional categories into a hierarchy (4 levels)
• 7.6G, 2.9M sequences

2 Data

• 2016 Lake Champlain algal bloom samples
• Metatranscriptomes (2 X 100bp Illumina HiSeq)
• 3 sampling site: PRM (15 samples), St1 (27 samples), St2 (23 samples)
• 130 fastq samples (79G of fastq.gz data)

3 Running the pipeline

• Compute Canada (Cedar)
• I ran the pipeline seperately for each sampling site
• 48 CPUs, ~15hr X 3 sampling sites

4 Results

4.1 Merging

## [1] "50.95 % of sequences were merged"
## [1] "Does merging make sense? Depending on size selection of amplicons sequenced (should check)"

4.2 Trimmomatic

## [1] "2.491 % of sequences were dropped"

4.3 Ribosomes

## [1] "9.211 % of sequences were ribosomes and removed"
## [1] "Three samples had a high % of ribosomes removed"
##                                                                    V1
## 1 HI.4752.003.NEBNext_Index_13.000576_ChampSt1-20160915-WatPhotz_RNAa
## 2 HI.4752.003.NEBNext_Index_14.000577_ChampSt1-20160915-WatPhotz_RNAb
## 3 HI.4752.003.NEBNext_Index_15.000578_ChampSt1-20160915-WatPhotz_RNAc
##      V9
## 1 57.95
## 2 62.78
## 3 65.08

4.4 Other stats

## [1] "262093: mininum nb sequence annotated per sample (refseq)"
## [1] "19836517: maximum nb sequence annotated per sample (refseq)"
## [1] "1337000: mean nb sequence annotated per sample (refseq)"
## [1] "57071: mininum nb sequence annotated per sample (subsys)"
## [1] "9220714: maximum nb sequence annotated per sample (subsys)"
## [1] "550200: mean nb sequence annotated per sample (subsys)"
## [1] "0.7181: Total Number of paired-end reads (billions)"
## [1] "0.16: Total Number of sequences after cleaning+merging (billions)"
## [1] "5.613e+10% : percentage of sequences annotated (refseq)"
## [1] "2.311e+10% : percentage of sequences annotated (subsys)"

• Note: Diamond threshold (default=0.001: kinda high, but most keep high, given that these are short reads, even if merged…)…
• Note: threshold may not be that important, given that we are more interested in how things change over time…

4.5 RefSeq annotations - species

 
 

• Here I only show spp. that are present >5% in at least one of the samples
• We can see the Dolichospermum bloom
• What is going in the St1 samples on Sept 15th?
Inediibacterium massiliense and Blautia massiliensis are human fecal bacteria…

4.5.1 Environmental data (N,P,Chl)

• Huge peak in values on Sept 9th, 2016.
 
 

4.5.2 I. massiliense

• All I. massiliense 20160915_St1 sequences match to a single hypothetical (i.e predicted) protein (WP_053955528.1) in refseq.
e-values ~e-20 (because sequences are small + sequence identity ~90%)

4.5.2.1 I. massiliense genome BLAST

• Download I. massiliense genome.
• BLASTn sequences (for now only 100k sequences) against it.
• 20160915_St1: hits to a single region of ~150bp (sequence identity ~90%, 6206 hits/100k sequences)
• 20160601_St1: hits to a single region of ~150bp (sequence identity ~90%, 31 hits/100k sequences)
 
• Conclusion: sequences may NOT be I. massiliense, but something close and ~200X more abundant in these samples

4.5.3 B. massiliensis

• All B. massiliensis 20160915_St1 sequences match to a single hypothetical (i.e predicted) protein (WP_059086336.1) in refseq
e-values ~e-10 (because sequences are small + %identity ~90%)

4.5.3.1 B. massiliensis genome BLAST

• Download B. massiliensis genome (3.5 Mbp)
• BLASTn sequences (for now only 100k sequences) against it.
• Sequences BLAST to several regions of the genome
• 20160915_St1: 34801 hits/100k sequences to several genomic regions
• 20160601_St1: 188 hits/100k sequences

• Same conclusion: may not be B. massiliensis, but something close and much more abundant in these samples (~200X).
 

4.5.4 Are the B. massiliensis & I. massiliense genomic regions the same?

samtools faidx Inediibacterium_massiliense_genome.fasta samtools faidx Inediibacterium_massiliense_genome.fasta NZ_LN876587.1:1637150-1637400 >I_massiliense_contig1.fasta

samtools faidx Inediibacterium_massiliense_genome.fasta samtools faidx Blautia_massiliensis_GD8_NZ_LN913006_WGS.fasta NZ_LN913006.1:2359000-2360200 >B_massiliensis_contig1.fasta

• Yes, if we align the genomics regions for which the sequences matched to (with MUSCLE), the are almost the same (~99% identity of ~150bp)

4.6 RefSeq annotations - species Fold change

 
 

• Here I only show spp. that are present >5% in at least one of the samples
• Replicated (a, b & c) were merged. • We can see the Dolichospermum bloom much better. In addition to the Microcystis + Anabaena sp. blooms

4.7 RefSeq annotations - gene names

• Here I only show spp. that are present >1% in at least one of the samples.

4.8 Subseq annotations - functions

• Here I only show spp. that are present >1% in at least one of the samples (for level 2:4 >-.5% for level 1).
 

4.9 Principal Coordinate Analyses

• Note that the Sept 9th 2016 St1 sites were removed, because they skew the ordinations too much. • Most of the variation is explained by Axis 1. Samples get further away as time goes by.

5 To do (lab meeting Oct 19th):

• What is sp. PCC?
Bug in the SAMSA2 python annotation scripts. Fixed

• Check the Microcystis in the database and in the list of annotated genes?
There are 67425 Microcystis gene products (63.5k are for M. aeruginosa) in the dataset
This is a very good representation

• Get a Microcystis protein set from Olga?
Done, but not sure how usefull, given that it is actually well represented in the dataset

• Check illumina primer/adaptor sequences (find them on nanuq), if they may cause contamination (esp. for the 09-15 samples)?
Done. Refiltered samples as there was indeed some contamination
Didn’t change results very much

• Check sequence similarity bwtn the 2 fecal species proteins
done. There are almost the same…

• Theoretically, you should get a could corellation bwtn annotation to Rubisco + Photosystem and chlorophyll content. I should verify this (simple correlation or can also perfom a more complex RDA)
Not done yet, but I suspect correlation is not great… Why: I don’t know

• Check annotatin at the other three functionnal levels (right know, I only look at the first one…)
Done at all 4 levels

eggnog annotation Not done

• I can do a PCA to look at sample clustering based on communities
Done using phyloseq (Sept 15th samples were removed because they skew the ordinations so much…)

• I can do an RDA to see how certain samples may correlate with environmental data
Not done, as environmental variables are being updated now

• Look at alpha + beta diversity
Not done